Predicting sentiment from product reviews

Fire up GraphLab Create


In [63]:
import graphlab

Read some product review data

Loading reviews for a set of baby products.


In [64]:
products = graphlab.SFrame('amazon_baby.gl/')

Let's explore this data together

Data includes the product name, the review text and the rating of the review.


In [65]:
products.head()


Out[65]:
name review rating
Planetwise Flannel Wipes These flannel wipes are
OK, but in my opinion ...
3.0
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0
Stop Pacifier Sucking
without tears with ...
All of my kids have cried
non-stop when I tried to ...
5.0
Stop Pacifier Sucking
without tears with ...
When the Binky Fairy came
to our house, we didn't ...
5.0
A Tale of Baby's Days
with Peter Rabbit ...
Lovely book, it's bound
tightly so you may no ...
4.0
Baby Tracker® - Daily
Childcare Journal, ...
Perfect for new parents.
We were able to keep ...
5.0
Baby Tracker® - Daily
Childcare Journal, ...
A friend of mine pinned
this product on Pinte ...
5.0
Baby Tracker® - Daily
Childcare Journal, ...
This has been an easy way
for my nanny to record ...
4.0
[10 rows x 3 columns]

Build the word count vector for each review


In [66]:
products['word_count'] = graphlab.text_analytics.count_words(products['review'])

In [67]:
products.head()


Out[67]:
name review rating word_count
Planetwise Flannel Wipes These flannel wipes are
OK, but in my opinion ...
3.0 {'and': 5L, 'stink': 1L,
'because': 1L, 'order ...
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0 {'and': 3L, 'love': 1L,
'it': 2L, 'highly': 1L, ...
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0 {'and': 2L, 'quilt': 1L,
'it': 1L, 'comfortable': ...
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0 {'ingenious': 1L, 'and':
3L, 'love': 2L, ...
Stop Pacifier Sucking
without tears with ...
All of my kids have cried
non-stop when I tried to ...
5.0 {'and': 2L, 'parents!!':
1L, 'all': 2L, 'puppe ...
Stop Pacifier Sucking
without tears with ...
When the Binky Fairy came
to our house, we didn't ...
5.0 {'and': 2L, 'cute': 1L,
'help': 2L, 'doll': 1L, ...
A Tale of Baby's Days
with Peter Rabbit ...
Lovely book, it's bound
tightly so you may no ...
4.0 {'shop': 1L, 'be': 1L,
'is': 1L, 'it': 1L, ' ...
Baby Tracker® - Daily
Childcare Journal, ...
Perfect for new parents.
We were able to keep ...
5.0 {'feeding,': 1L, 'and':
2L, 'all': 1L, 'right': ...
Baby Tracker® - Daily
Childcare Journal, ...
A friend of mine pinned
this product on Pinte ...
5.0 {'and': 1L, 'help': 1L,
'give': 1L, 'is': 1L, ...
Baby Tracker® - Daily
Childcare Journal, ...
This has been an easy way
for my nanny to record ...
4.0 {'journal.': 1L, 'all':
1L, 'standarad': 1L, ...
[10 rows x 4 columns]


In [88]:
graphlab.canvas.set_target('ipynb')

In [69]:
products['name'].show()


Examining the reviews for most-sold product: 'Vulli Sophie the Giraffe Teether'


In [70]:
giraffe_reviews = products[products['name'] == 'Vulli Sophie the Giraffe Teether']

In [71]:
len(giraffe_reviews)


Out[71]:
785

In [72]:
giraffe_reviews['rating'].show(view='Categorical')


Build a sentiment classifier


In [73]:
products['rating'].show(view='Categorical')


Define what's a positive and a negative sentiment

We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment. Reviews with a rating of 4 or higher will be considered positive, while the ones with rating of 2 or lower will have a negative sentiment.


In [74]:
#ignore all 3* reviews
products = products[products['rating'] != 3]

In [75]:
#positive sentiment = 4* or 5* reviews
products['sentiment'] = products['rating'] >=4

In [76]:
products.head()


Out[76]:
name review rating word_count sentiment
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0 {'and': 3L, 'love': 1L,
'it': 2L, 'highly': 1L, ...
1
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0 {'and': 2L, 'quilt': 1L,
'it': 1L, 'comfortable': ...
1
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0 {'ingenious': 1L, 'and':
3L, 'love': 2L, ...
1
Stop Pacifier Sucking
without tears with ...
All of my kids have cried
non-stop when I tried to ...
5.0 {'and': 2L, 'parents!!':
1L, 'all': 2L, 'puppe ...
1
Stop Pacifier Sucking
without tears with ...
When the Binky Fairy came
to our house, we didn't ...
5.0 {'and': 2L, 'cute': 1L,
'help': 2L, 'doll': 1L, ...
1
A Tale of Baby's Days
with Peter Rabbit ...
Lovely book, it's bound
tightly so you may no ...
4.0 {'shop': 1L, 'be': 1L,
'is': 1L, 'it': 1L, ' ...
1
Baby Tracker® - Daily
Childcare Journal, ...
Perfect for new parents.
We were able to keep ...
5.0 {'feeding,': 1L, 'and':
2L, 'all': 1L, 'right': ...
1
Baby Tracker® - Daily
Childcare Journal, ...
A friend of mine pinned
this product on Pinte ...
5.0 {'and': 1L, 'help': 1L,
'give': 1L, 'is': 1L, ...
1
Baby Tracker® - Daily
Childcare Journal, ...
This has been an easy way
for my nanny to record ...
4.0 {'journal.': 1L, 'all':
1L, 'standarad': 1L, ...
1
Baby Tracker® - Daily
Childcare Journal, ...
I love this journal and
our nanny uses it ...
4.0 {'all': 1L, 'forget': 1L,
'just': 1L, "daughter ...
1
[10 rows x 5 columns]

Let's train the sentiment classifier


In [77]:
train_data,test_data = products.random_split(.8, seed=0)

In [78]:
sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=['word_count'],
                                                     validation_set=test_data)


PROGRESS: Logistic regression:

PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 133448
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 219217
PROGRESS: Number of coefficients    : 219218
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 5        | 0.000002  | 1.635327     | 0.841481          | 0.839989            |
PROGRESS: | 2         | 9        | 3.000000  | 3.323665     | 0.947425          | 0.894877            |
PROGRESS: | 3         | 10       | 3.000000  | 3.933787     | 0.923768          | 0.866232            |
PROGRESS: | 4         | 11       | 3.000000  | 4.531906     | 0.971779          | 0.912743            |
PROGRESS: | 5         | 12       | 3.000000  | 5.124025     | 0.975511          | 0.908900            |
PROGRESS: | 6         | 13       | 3.000000  | 5.722144     | 0.899991          | 0.825967            |
PROGRESS: | 10        | 18       | 1.000000  | 8.458692     | 0.988715          | 0.916256            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

Evaluate the sentiment model


In [79]:
sentiment_model.evaluate(test_data)


Out[79]:
{'accuracy': 0.916256305548883, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  1461 |
 |      0       |        1        |  1328 |
 |      0       |        0        |  4000 |
 |      1       |        1        | 26515 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns]}

In [80]:
sentiment_model.evaluate(test_data, metric='roc_curve')


Out[80]:
{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 1001
 
 Data:
 +------------------+----------------+------------------+-------+------+
 |    threshold     |      fpr       |       tpr        |   p   |  n   |
 +------------------+----------------+------------------+-------+------+
 |       0.0        | 0.218938253012 | 0.00517007772944 | 28046 | 5312 |
 | 0.0010000000475  | 0.781061746988 |  0.994829922271  | 28046 | 5312 |
 | 0.00200000009499 | 0.741528614458 |  0.993653283891  | 28046 | 5312 |
 | 0.00300000002608 | 0.719314759036 |  0.992904514013  | 28046 | 5312 |
 | 0.00400000018999 | 0.703689759036 |  0.992405334094  | 28046 | 5312 |
 | 0.00499999988824 | 0.692582831325 |  0.992013121301  | 28046 | 5312 |
 | 0.00600000005215 | 0.682793674699 |  0.991513941382  | 28046 | 5312 |
 | 0.00700000021607 | 0.672251506024 |  0.991157384297  | 28046 | 5312 |
 | 0.00800000037998 | 0.662085843373 |  0.990836482921  | 28046 | 5312 |
 | 0.00899999961257 | 0.654743975904 |  0.990515581545  | 28046 | 5312 |
 +------------------+----------------+------------------+-------+------+
 [1001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [34]:
sentiment_model.show(view='Evaluation')


Applying the learned model to understand sentiment for Giraffe


In [35]:
giraffe_reviews['predicted_sentiment'] = sentiment_model.predict(giraffe_reviews, output_type='probability')

In [36]:
giraffe_reviews.head()


Out[36]:
name review rating word_count predicted_sentiment
Vulli Sophie the Giraffe
Teether ...
He likes chewing on all
the parts especially the ...
5.0 {'and': 1L, 'all': 1L,
'because': 1L, 'it': 1L, ...
0.999513023521
Vulli Sophie the Giraffe
Teether ...
My son loves this toy and
fits great in the diaper ...
5.0 {'and': 1L, 'right': 1L,
'help': 1L, 'just': 1L, ...
0.999320678306
Vulli Sophie the Giraffe
Teether ...
There really should be a
large warning on the ...
1.0 {'and': 2L, 'all': 1L,
'latex.': 1L, 'being': ...
0.013558811687
Vulli Sophie the Giraffe
Teether ...
All the moms in my moms'
group got Sophie for ...
5.0 {'and': 2L, 'one!': 1L,
'all': 1L, 'love': 1L, ...
0.995769474148
Vulli Sophie the Giraffe
Teether ...
I was a little skeptical
on whether Sophie was ...
5.0 {'and': 3L, 'all': 1L,
'old': 1L, 'her.': 1L, ...
0.662374415673
Vulli Sophie the Giraffe
Teether ...
I have been reading about
Sophie and was going ...
5.0 {'and': 6L, 'seven': 1L,
'already': 1L, 'love': ...
0.999997148186
Vulli Sophie the Giraffe
Teether ...
My neice loves her sophie
and has spent hours ...
5.0 {'and': 4L, 'drooling,':
1L, 'love': 1L, 'her.': ...
0.989190989536
Vulli Sophie the Giraffe
Teether ...
What a friendly face!
And those mesmerizing ...
5.0 {'and': 3L, 'chew': 1L,
"don't": 1L, 'is': 1L, ...
0.999563518413
Vulli Sophie the Giraffe
Teether ...
We got this just for my
son to chew on instea ...
5.0 {'chew': 2L, 'because':
1L, 'just': 2L, 'what': ...
0.970160542725
Vulli Sophie the Giraffe
Teether ...
My baby seems to like
this toy, but I could ...
3.0 {'and': 2L, 'already':
1L, 'in': 1L, 'some': ...
0.195367644588
[10 rows x 5 columns]

Sort the reviews based on the predicted sentiment and explore


In [37]:
giraffe_reviews = giraffe_reviews.sort('predicted_sentiment', ascending=False)

In [38]:
giraffe_reviews.head()


Out[38]:
name review rating word_count predicted_sentiment
Vulli Sophie the Giraffe
Teether ...
Sophie, oh Sophie, your
time has come. My ...
5.0 {'giggles': 1L, 'all':
1L, "violet's": 2L, ...
1.0
Vulli Sophie the Giraffe
Teether ...
I'm not sure why Sophie
is such a hit with the ...
4.0 {'peace': 1L, 'month':
1L, 'bright': 1L, ...
0.999999999703
Vulli Sophie the Giraffe
Teether ...
I'll be honest...I bought
this toy because all the ...
4.0 {'all': 2L, 'pops': 1L,
'existence.': 1L, ...
0.999999999392
Vulli Sophie the Giraffe
Teether ...
We got this little
giraffe as a gift from a ...
5.0 {'all': 2L, "don't": 1L,
'(literally).so': 1L, ...
0.99999999919
Vulli Sophie the Giraffe
Teether ...
As a mother of 16month
old twins; I bought ...
5.0 {'cute': 1L, 'all': 1L,
'reviews.': 2L, 'just': ...
0.999999998657
Vulli Sophie the Giraffe
Teether ...
Sophie the Giraffe is the
perfect teething toy. ...
5.0 {'just': 2L, 'both': 1L,
'month': 1L, 'ears,': ...
0.999999997108
Vulli Sophie the Giraffe
Teether ...
Sophie la giraffe is
absolutely the best toy ...
5.0 {'and': 5L, 'the': 1L,
'all': 1L, 'old': 1L, ...
0.999999995589
Vulli Sophie the Giraffe
Teether ...
My 5-mos old son took to
this immediately. The ...
5.0 {'just': 1L, 'shape': 2L,
'mutt': 1L, '"dog': 1L, ...
0.999999995573
Vulli Sophie the Giraffe
Teether ...
My nephews and my four
kids all had Sophie in ...
5.0 {'and': 4L, 'chew': 1L,
'all': 1L, 'perfect;': ...
0.999999989527
Vulli Sophie the Giraffe
Teether ...
Never thought I'd see my
son French kissing a ...
5.0 {'giggles': 1L, 'all':
1L, 'out,': 1L, 'over': ...
0.999999985069
[10 rows x 5 columns]

Most positive reviews for the giraffe


In [23]:
giraffe_reviews[0]['review']


Out[23]:
"Sophie, oh Sophie, your time has come. My granddaughter, Violet is 5 months old and starting to teeth. What joy little Sophie brings to Violet. Sophie is made of a very pliable rubber that is sturdy but not tough. It is quite easy for Violet to twist Sophie into unheard of positions to get Sophie into her mouth. The little nose and hooves fit perfectly into small mouths, and the drooling has purpose. The paint on Sophie is food quality.Sophie was born in 1961 in France. The maker had wondered why there was nothing available for babies and made Sophie from the finest rubber, phthalate-free on St Sophie's Day, thus the name was born. Since that time millions of Sophie's populate the world. She is soft and for babies little hands easy to grasp. Violet especially loves the bumpy head and horns of Sophie. Sophie has a long neck that easy to grasp and twist. She has lovely, sizable spots that attract Violet's attention. Sophie has happy little squeaks that bring squeals of delight from Violet. She is able to make Sophie squeak and that brings much joy. Sophie's smooth skin is soothing to Violet's little gums. Sophie is 7 inches tall and is the exact correct size for babies to hold and love.As you well know the first thing babies grasp, goes into their mouths- how wonderful to have a toy that stimulates all of the senses and helps with the issue of teething. Sophie is small enough to fit into any size pocket or bag. Sophie is the perfect find for babies from a few months to a year old. How wonderful to hear the giggles and laughs that emanate from babies who find Sophie irresistible. Viva La Sophie!Highly Recommended.  prisrob 12-11-09"

In [24]:
giraffe_reviews[1]['review']


Out[24]:
"I'm not sure why Sophie is such a hit with the little ones, but my 7 month old baby girl is one of her adoring fans.  The rubber is softer and more pleasant to handle, and my daughter has enjoyed chewing on her legs and the nubs on her head even before she started teething.  She also loves the squeak that Sophie makes when you squeeze her.  Not sure what it is but if Sophie is amongst a pile of her other toys, my daughter will more often than not reach for Sophie.  And I have the peace of mind of knowing that only edible and safe paints and materials have been used to make Sophie, as opposed to Bright Starts and other baby toys made in China.  Now that the research is out on phthalates and other toxic substances in baby toys, I think it's more important than ever to find good quality toys that are also safe for our babies to handle and put in their mouths.  Sophie is a must-have for every new mom in my opinion.  Even if your kid is one of the few that can take or leave her, it's worth a try.  Vulli, the makers of Sophie, also make natural rubber teething rings that my daughter loves as well."

Show most negative reviews for giraffe


In [25]:
giraffe_reviews[-1]['review']


Out[25]:
"My son (now 2.5) LOVED his Sophie, and I bought one for every baby shower I've gone to. Now, my daughter (6 months) just today nearly choked on it and I will never give it to her again. Had I not been within hearing range it could have been fatal. The strange sound she was making caught my attention and when I went to her and found the front curved leg shoved well down her throat and her face a purply/blue I panicked. I pulled it out and she vomited all over the carpet before screaming her head off. I can't believe how my opinion of this toy has changed from a must-have to a must-not-use. Please don't disregard any of the choking hazard comments, they are not over exaggerated!"

In [26]:
giraffe_reviews[-2]['review']


Out[26]:
"This children's toy is nostalgic and very cute. However, there is a distinct rubber smell and a very odd taste, yes I tried it, that my baby did not enjoy. Also, if it is soiled it is extremely difficult to clean as the rubber is a kind of porus material and does not clean well. The final thing is the squeaking device inside which stopped working after the first couple of days. I returned this item feeling I had overpaid for a toy that was defective and did not meet my expectations. Please do not be swayed by the cute packaging and hype surounding it as I was. One more thing, I was given a full refund from Amazon without any problem."

Exercise

1. Use .apply() to build a new feature with the counts for each of the selected_words:

In the notebook above, we created a column ‘word_count’ with the word counts for each review. Our first task is to create a new column in the products SFrame with the counts for each selected_word above, and, in the process, we will see how the method .apply() can be used to create new columns in our data (our features) and how to use a Python function, which is an extremely useful concept to grasp!


In [15]:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']

In [81]:
def awesome_count(word_count):
    if 'awesome' in word_count:
        return word_count['awesome']
    return 0

products['awesome'] = products['word_count'].apply(awesome_count)

def great_count(word_count):
    if 'great' in word_count:
        return word_count['great']
    return 0

products['great'] = products['word_count'].apply(great_count)

def fantastic_count(word_count):
    if 'fantastic' in word_count:
        return word_count['fantastic']
    return 0

products['fantastic'] = products['word_count'].apply(fantastic_count)

def amazing_count(word_count):
    if 'amazing' in word_count:
        return word_count['amazing']
    return 0

products['amazing'] = products['word_count'].apply(amazing_count)

def love_count(word_count):
    if 'love' in word_count:
        return word_count['love']
    return 0

products['love'] = products['word_count'].apply(love_count)

def horrible_count(word_count):
    if 'horrible' in word_count:
        return word_count['horrible']
    return 0

products['horrible'] = products['word_count'].apply(horrible_count)

def bad_count(word_count):
    if 'bad' in word_count:
        return word_count['bad']
    return 0

products['bad'] = products['word_count'].apply(bad_count)

def terrible_count(word_count):
    if 'terrible' in word_count:
        return word_count['terrible']
    return 0

products['terrible'] = products['word_count'].apply(terrible_count)

def awful_count(word_count):
    if 'awful' in word_count:
        return word_count['awful']
    return 0

products['awful'] = products['word_count'].apply(awful_count)

def wow_count(word_count):
    if 'wow' in word_count:
        return word_count['wow']
    return 0

products['wow'] = products['word_count'].apply(wow_count)

def hate_count(word_count):
    if 'hate' in word_count:
        return word_count['hate']
    return 0

products['hate'] = products['word_count'].apply(hate_count)

In [82]:
# products['awesome'] = products['word_count'].apply(awesome_count)

In [83]:
# # Generalize function for apply

# def selected_words_count(word_count, word):
#     if word in word_count:
#         return word_count[word]
#     return 0

In [84]:
# for word in selected_words:
#     products[word] = products.apply(lambda x: selected_words_count(x['word_count'], word))

In [85]:
products.head()


Out[85]:
name review rating word_count sentiment awesome
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0 {'and': 3L, 'love': 1L,
'it': 2L, 'highly': 1L, ...
1 0
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0 {'and': 2L, 'quilt': 1L,
'it': 1L, 'comfortable': ...
1 0
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0 {'ingenious': 1L, 'and':
3L, 'love': 2L, ...
1 0
Stop Pacifier Sucking
without tears with ...
All of my kids have cried
non-stop when I tried to ...
5.0 {'and': 2L, 'parents!!':
1L, 'all': 2L, 'puppe ...
1 0
Stop Pacifier Sucking
without tears with ...
When the Binky Fairy came
to our house, we didn't ...
5.0 {'and': 2L, 'cute': 1L,
'help': 2L, 'doll': 1L, ...
1 0
A Tale of Baby's Days
with Peter Rabbit ...
Lovely book, it's bound
tightly so you may no ...
4.0 {'shop': 1L, 'be': 1L,
'is': 1L, 'it': 1L, ' ...
1 0
Baby Tracker® - Daily
Childcare Journal, ...
Perfect for new parents.
We were able to keep ...
5.0 {'feeding,': 1L, 'and':
2L, 'all': 1L, 'right': ...
1 0
Baby Tracker® - Daily
Childcare Journal, ...
A friend of mine pinned
this product on Pinte ...
5.0 {'and': 1L, 'help': 1L,
'give': 1L, 'is': 1L, ...
1 0
Baby Tracker® - Daily
Childcare Journal, ...
This has been an easy way
for my nanny to record ...
4.0 {'journal.': 1L, 'all':
1L, 'standarad': 1L, ...
1 0
Baby Tracker® - Daily
Childcare Journal, ...
I love this journal and
our nanny uses it ...
4.0 {'all': 1L, 'forget': 1L,
'just': 1L, "daughter ...
1 0
great fantastic amazing love horrible bad terrible awful wow hate
0.0 0.0 0.0 1.0 0 0 0.0 0 0 0
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0
0.0 0.0 0.0 2.0 0 0 0.0 0 0 0
1.0 0.0 0.0 0.0 0 0 0.0 0 0 0
1.0 0.0 0.0 0.0 0 0 0.0 0 0 0
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0
0.0 0.0 0.0 2.0 0 0 0.0 0 0 0
[10 rows x 16 columns]

  • Using the .sum() method on each of the new columns you created, answer the following questions: Out of the selected_words, which one is most used in the dataset? Which one is least used? Save these results to answer the quiz at the end.

In [18]:
print 'Word count value:'

for word in selected_words:
    print '{0}: {1}'.format(word, products[word].sum())
    
   
# awesome: 2002
# great: 42420.0
# fantastic: 873
# amazing: 1305
# love: 40277.0
# horrible: 659
# bad: 3197
# terrible: 673
# awful: 345
# wow: 131
# hate: 1057


Word count value:
awesome: 2002
great: 42420.0
fantastic: 873.0
amazing: 1305.0
love: 40277.0
horrible: 659
bad: 3197
terrible: 673.0
awful: 345
wow: 131
hate: 1057

2. Create a new sentiment analysis model using only the selected_words as features:

In the IPython Notebook above, we used word counts for all words as features for our sentiment classifier. Now, you are just going to use the selected_words:


In [89]:
train_data,test_data = products.random_split(.8, seed=0)

In [90]:
selected_words_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=selected_words,
                                                     validation_set=test_data)


PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 133448
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 11
PROGRESS: Number of unpacked features : 11
PROGRESS: Number of coefficients    : 12
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 0.269027     | 0.844299          | 0.842842            |
PROGRESS: | 2         | 3        | 0.450045     | 0.844186          | 0.842842            |
PROGRESS: | 3         | 4        | 0.594059     | 0.844276          | 0.843142            |
PROGRESS: | 4         | 5        | 0.760076     | 0.844269          | 0.843142            |
PROGRESS: | 5         | 6        | 0.942094     | 0.844269          | 0.843142            |
PROGRESS: | 6         | 7        | 1.159116     | 0.844269          | 0.843142            |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
  • You will now examine the weights the learned classifier assigned to each of the 11 words in selected_words and gain intuition as to what the ML algorithm did for your data using these features. In GraphLab Create, a learned model, such as the selected_words_model, has a field 'coefficients', which lets you look at the learned coefficients. You can access it by using:

In [98]:
coef = selected_words_model['coefficients']
  • Using this approach, sort the learned coefficients according to the ‘value’ column using .sort(). Out of the 11 words in selected_words, which one got the most positive weight? Which one got the most negative weight? Do these values make sense for you? Save these results to answer the quiz at the end.

In [99]:
coef = coef.sort('value', ascending=False)
coef


Out[99]:
name index class value
love None 1 1.39989834302
(intercept) None 1 1.36728315229
awesome None 1 1.05800888878
amazing None 1 0.892802422508
fantastic None 1 0.891303090304
great None 1 0.883937894898
wow None 1 -0.0541450123333
bad None 1 -0.985827369929
hate None 1 -1.40916406276
awful None 1 -1.76469955631
[12 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [100]:
coef.sort('value', ascending=True)


Out[100]:
name index class value
terrible None 1 -2.09049998487
horrible None 1 -1.99651800559
awful None 1 -1.76469955631
hate None 1 -1.40916406276
bad None 1 -0.985827369929
wow None 1 -0.0541450123333
great None 1 0.883937894898
fantastic None 1 0.891303090304
amazing None 1 0.892802422508
awesome None 1 1.05800888878
[12 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

3. Comparing the accuracy of different sentiment analysis model:

  • What is the accuracy of the selected_words_model on the test_data? What was the accuracy of the sentiment_model that we learned using all the word counts in the IPython Notebook above from the lectures? What is the accuracy majority class classifier on this task? How do you compare the different learned models with the baseline approach where we are just predicting the majority class? Save these results to answer the quiz at the end.

Hint: we discussed the majority class classifier in lecture, which simply predicts that every data point is from the most common class. This is baseline is something we definitely want to beat with models we learn from data.


In [93]:
selected_words_model.evaluate(test_data)


Out[93]:
{'accuracy': 0.8431419649291376, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        0        |  234  |
 |      1       |        0        |  130  |
 |      0       |        1        |  5094 |
 |      1       |        1        | 27846 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns]}

In [62]:
sentiment_model.evaluate(test_data)


Out[62]:
{'accuracy': 0.916256305548883, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  1461 |
 |      0       |        1        |  1328 |
 |      0       |        0        |  4000 |
 |      1       |        1        | 26515 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns]}

In [94]:
selected_words_model.evaluate(test_data, metric='roc_curve')


Out[94]:
{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 1001
 
 Data:
 +------------------+----------------+-------------------+-------+------+
 |    threshold     |      fpr       |        tpr        |   p   |  n   |
 +------------------+----------------+-------------------+-------+------+
 |       0.0        |      0.0       | 3.57040845473e-05 | 28008 | 5304 |
 | 0.0010000000475  |      1.0       |   0.999964295915  | 28008 | 5304 |
 | 0.00200000009499 | 0.999811463047 |   0.999928591831  | 28008 | 5304 |
 | 0.00300000002608 | 0.999811463047 |   0.999928591831  | 28008 | 5304 |
 | 0.00400000018999 | 0.999622926094 |   0.999928591831  | 28008 | 5304 |
 | 0.00499999988824 | 0.999622926094 |   0.999928591831  | 28008 | 5304 |
 | 0.00600000005215 | 0.99943438914  |   0.999928591831  | 28008 | 5304 |
 | 0.00700000021607 | 0.99943438914  |   0.999928591831  | 28008 | 5304 |
 | 0.00800000037998 | 0.99943438914  |   0.999928591831  | 28008 | 5304 |
 | 0.00899999961257 | 0.99943438914  |   0.999928591831  | 28008 | 5304 |
 +------------------+----------------+-------------------+-------+------+
 [1001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [95]:
selected_words_model.show(view='Evaluation')


4. Interpreting the difference in performance between the models:

To understand why the model with all word counts performs better than the one with only the selected_words, we will now examine the reviews for a particular product.

  • We will investigate a product named ‘Baby Trend Diaper Champ’. (This is a trash can for soiled baby diapers, which keeps the smell contained.)

  • Just like we did for the reviews for the giraffe toy in the IPython Notebook in the lecture video, before we start our analysis you should select all reviews where the product name is ‘Baby Trend Diaper Champ’. Let’s call this table diaper_champ_reviews.

  • Again, just as in the video, use the sentiment_model to predict the sentiment of each review in diaper_champ_reviews and sort the results according to their ‘predicted_sentiment’.

  • What is the ‘predicted_sentiment’ for the most positive review for ‘Baby Trend Diaper Champ’ according to the sentiment_model from the IPython Notebook from lecture? Save this result to answer the quiz at the end.

  • Now use the selected_words_model you learned using just the selected_words to predict the sentiment most positive review you found above. Hint: if you sorted the diaper_champ_reviews in descending order (from most positive to most negative), this command will be helpful to make the prediction you need:


In [48]:
diaper_champ_reviews = products[products['name'] == 'Baby Trend Diaper Champ']

In [49]:
diaper_champ_reviews.head()


Out[49]:
name review rating word_count sentiment awesome
Baby Trend Diaper Champ Ok - newsflash. Diapers
are just smelly. We've ...
4.0 {'just': 2L, 'less': 1L,
'-': 3L, 'smell- ...
1 0
Baby Trend Diaper Champ My husband and I selected
the Diaper "Champ" ma ...
1.0 {'just': 1L, 'less': 1L,
'when': 3L, 'over': 1L, ...
0 0
Baby Trend Diaper Champ Excellent diaper disposal
unit. I used it in ...
5.0 {'control': 1L, 'am': 1L,
'it': 1L, 'used': 1L, ...
1 0
Baby Trend Diaper Champ We love our diaper champ.
It is very easy to use ...
5.0 {'and': 3L, 'over.': 1L,
'all': 1L, 'love': 1L, ...
1 0
Baby Trend Diaper Champ Two girlfriends and two
family members put me ...
5.0 {'just': 1L, 'when': 1L,
'both': 1L, 'results': ...
1 0
Baby Trend Diaper Champ I waited to review this
until I saw how it ...
4.0 {'lysol': 1L, 'all': 1L,
'mom.': 1L, 'busy': 1L, ...
1 0
Baby Trend Diaper Champ I have had a diaper genie
for almost 4 years since ...
1.0 {'all': 1L, 'bags.': 1L,
'just': 1L, "don't": 2L, ...
0 0
Baby Trend Diaper Champ I originally put this
item on my baby registry ...
5.0 {'lysol': 1L, 'all': 2L,
'bags.': 1L, 'feedback': ...
1 0
Baby Trend Diaper Champ I am so glad I got the
Diaper Champ instead of ...
5.0 {'and': 2L, 'all': 1L,
'just': 1L, 'is': 2L, ...
1 0
Baby Trend Diaper Champ We had 2 diaper Genie's
both given to us as a ...
4.0 {'hand.': 1L,
'(required': 1L, ...
1 0
great fantastic amazing love horrible bad terrible awful wow hate
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0
0.0 0.0 0.0 1.0 0 0 0.0 0 0 0
0.0 0.0 0.0 0.0 1 0 0.0 0 0 0
0.0 0.0 0.0 0.0 0 1 0.0 0 0 0
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0
0.0 0.0 0.0 2.0 0 0 0.0 0 0 0
[10 rows x 16 columns]


In [50]:
diaper_champ_reviews['predicted_sentiment'] = sentiment_model.predict(diaper_champ_reviews, output_type='probability')

In [51]:
diaper_champ_reviews = diaper_champ_reviews.sort('predicted_sentiment', ascending=False)

In [52]:
diaper_champ_reviews.head()


Out[52]:
name review rating word_count sentiment awesome
Baby Trend Diaper Champ Baby Luke can turn a
clean diaper to a dirty ...
5.0 {'all': 1L, 'less': 1L,
"friend's": 1L, '(whi ...
1 0
Baby Trend Diaper Champ I LOOOVE this diaper
pail! Its the easies ...
5.0 {'just': 1L, 'over': 1L,
'rweek': 1L, 'sooo': 1L, ...
1 0
Baby Trend Diaper Champ We researched all of the
different types of di ...
4.0 {'all': 2L, 'just': 4L,
"don't": 2L, 'one,': 1L, ...
1 0
Baby Trend Diaper Champ My baby is now 8 months
and the can has been ...
5.0 {"don't": 1L, 'when': 1L,
'over': 1L, 'soon': 1L, ...
1 0
Baby Trend Diaper Champ This is absolutely, by
far, the best diaper ...
5.0 {'just': 3L, 'money': 1L,
'not': 2L, 'mechanism': ...
1 0
Baby Trend Diaper Champ Diaper Champ or Diaper
Genie? That was my ...
5.0 {'all': 1L, 'bags.': 1L,
'son,': 1L, '(i': 1L, ...
1 0
Baby Trend Diaper Champ Wow! This is fabulous.
It was a toss-up between ...
5.0 {'and': 4L, '"genie".':
1L, 'since': 1L, ...
1 0
Baby Trend Diaper Champ I originally put this
item on my baby registry ...
5.0 {'lysol': 1L, 'all': 2L,
'bags.': 1L, 'feedback': ...
1 0
Baby Trend Diaper Champ Two girlfriends and two
family members put me ...
5.0 {'just': 1L, 'when': 1L,
'both': 1L, 'results': ...
1 0
Baby Trend Diaper Champ I am one of those super-
critical shoppers who ...
5.0 {'taller': 1L, 'bags.':
1L, 'just': 1L, "don't": ...
1 0
great fantastic amazing love horrible bad terrible awful wow hate predicted_sentiment
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0 0.999999937267
0.0 0.0 0.0 1.0 0 0 0.0 0 0 0 0.999999917406
0.0 0.0 0.0 0.0 0 1 0.0 0 0 0 0.999999899509
2.0 0.0 0.0 0.0 0 1 0.0 0 0 0 0.999999836182
0.0 0.0 0.0 2.0 0 0 0.0 0 0 0 0.999999824745
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0 0.999999759315
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0 0.999999692111
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0 0.999999642488
0.0 0.0 0.0 0.0 1 0 0.0 0 0 0 0.999999604504
0.0 0.0 0.0 1.0 0 0 0.0 0 0 0 0.999999486804
[10 rows x 17 columns]


In [97]:
diaper_champ_reviews['predicted_sentiment'].max()


Out[97]:
0.9999999372669541

In [54]:
selected_words_model.predict(diaper_champ_reviews[0:1], output_type='probability')


Out[54]:
dtype: float
Rows: 1
[0.796940851290671]

In [56]:
# diaper_champ_reviews['predicted_sentiment_2']  = selected_words_model.predict(diaper_champ_reviews, output_type='probability')
diaper_champ_reviews.head()


Out[56]:
name review rating word_count sentiment awesome
Baby Trend Diaper Champ Baby Luke can turn a
clean diaper to a dirty ...
5.0 {'all': 1L, 'less': 1L,
"friend's": 1L, '(whi ...
1 0
Baby Trend Diaper Champ I LOOOVE this diaper
pail! Its the easies ...
5.0 {'just': 1L, 'over': 1L,
'rweek': 1L, 'sooo': 1L, ...
1 0
Baby Trend Diaper Champ We researched all of the
different types of di ...
4.0 {'all': 2L, 'just': 4L,
"don't": 2L, 'one,': 1L, ...
1 0
Baby Trend Diaper Champ My baby is now 8 months
and the can has been ...
5.0 {"don't": 1L, 'when': 1L,
'over': 1L, 'soon': 1L, ...
1 0
Baby Trend Diaper Champ This is absolutely, by
far, the best diaper ...
5.0 {'just': 3L, 'money': 1L,
'not': 2L, 'mechanism': ...
1 0
Baby Trend Diaper Champ Diaper Champ or Diaper
Genie? That was my ...
5.0 {'all': 1L, 'bags.': 1L,
'son,': 1L, '(i': 1L, ...
1 0
Baby Trend Diaper Champ Wow! This is fabulous.
It was a toss-up between ...
5.0 {'and': 4L, '"genie".':
1L, 'since': 1L, ...
1 0
Baby Trend Diaper Champ I originally put this
item on my baby registry ...
5.0 {'lysol': 1L, 'all': 2L,
'bags.': 1L, 'feedback': ...
1 0
Baby Trend Diaper Champ Two girlfriends and two
family members put me ...
5.0 {'just': 1L, 'when': 1L,
'both': 1L, 'results': ...
1 0
Baby Trend Diaper Champ I am one of those super-
critical shoppers who ...
5.0 {'taller': 1L, 'bags.':
1L, 'just': 1L, "don't": ...
1 0
great fantastic amazing love horrible bad terrible awful wow hate predicted_sentiment
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0 0.999999937267
0.0 0.0 0.0 1.0 0 0 0.0 0 0 0 0.999999917406
0.0 0.0 0.0 0.0 0 1 0.0 0 0 0 0.999999899509
2.0 0.0 0.0 0.0 0 1 0.0 0 0 0 0.999999836182
0.0 0.0 0.0 2.0 0 0 0.0 0 0 0 0.999999824745
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0 0.999999759315
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0 0.999999692111
0.0 0.0 0.0 0.0 0 0 0.0 0 0 0 0.999999642488
0.0 0.0 0.0 0.0 1 0 0.0 0 0 0 0.999999604504
0.0 0.0 0.0 1.0 0 0 0.0 0 0 0 0.999999486804
predicted_sentiment_2
0.796940851291
0.940876393428
0.5942241719
0.895606298305
0.984739056527
0.796940851291
0.796940851291
0.796940851291
0.347684052736
0.940876393428
[10 rows x 18 columns]


In [58]:
diaper_champ_reviews[0]['review']


Out[58]:
'Baby Luke can turn a clean diaper to a dirty diaper in 3 seconds flat. The diaper champ turns the smelly diaper into "what diaper smell" in less time than that. I hesitated and wondered what I REALLY needed for the nursery. This is one of the best purchases we made. The champ, the baby bjorn, fluerville diaper bag, and graco pack and play bassinet all vie for the best baby purchase.Great product, easy to use, economical, effective, absolutly fabulous.UpdateI knew that I loved the champ, and useing the diaper genie at a friend\'s house REALLY reinforced that!! There is no comparison, the chanp is easy and smell free, the genie was difficult to use one handed (which is absolutly vital if you have a little one on a changing pad) and there was a deffinite odor eminating from the genieplus we found that the quick tie garbage bags where the ties are integrated into the bag work really well because there isn\'t any added bulk around the sealing edge of the champ.'

In [61]:
diaper_champ_reviews[0]['word_count']


Out[61]:
{'"what': 1L,
 '(which': 1L,
 '3': 1L,
 'a': 6L,
 'absolutly': 2L,
 'added': 1L,
 'all': 1L,
 'and': 6L,
 'any': 1L,
 'are': 1L,
 'around': 1L,
 'at': 1L,
 'baby': 3L,
 'bag': 1L,
 'bag,': 1L,
 'bags': 1L,
 'bassinet': 1L,
 'because': 1L,
 'best': 2L,
 'bjorn,': 1L,
 'bulk': 1L,
 'can': 1L,
 'champ': 1L,
 'champ,': 2L,
 'champ.': 1L,
 'changing': 1L,
 'chanp': 1L,
 'clean': 1L,
 'comparison,': 1L,
 'deffinite': 1L,
 'diaper': 7L,
 'difficult': 1L,
 'dirty': 1L,
 'easy': 2L,
 'economical,': 1L,
 'edge': 1L,
 'effective,': 1L,
 'eminating': 1L,
 'fabulous.updatei': 1L,
 'flat.': 1L,
 'fluerville': 1L,
 'for': 2L,
 'found': 1L,
 'free,': 1L,
 "friend's": 1L,
 'from': 1L,
 'garbage': 1L,
 'genie': 2L,
 'genieplus': 1L,
 'graco': 1L,
 'handed': 1L,
 'have': 1L,
 'hesitated': 1L,
 'house': 1L,
 'i': 3L,
 'if': 1L,
 'in': 2L,
 'integrated': 1L,
 'into': 2L,
 'is': 4L,
 "isn't": 1L,
 'knew': 1L,
 'less': 1L,
 'little': 1L,
 'loved': 1L,
 'luke': 1L,
 'made.': 1L,
 'needed': 1L,
 'no': 1L,
 'nursery.': 1L,
 'odor': 1L,
 'of': 2L,
 'on': 1L,
 'one': 3L,
 'pack': 1L,
 'pad)': 1L,
 'play': 1L,
 'product,': 1L,
 'purchase.great': 1L,
 'purchases': 1L,
 'quick': 1L,
 'really': 3L,
 'reinforced': 1L,
 'sealing': 1L,
 'seconds': 1L,
 'smell': 1L,
 'smell"': 1L,
 'smelly': 1L,
 'than': 1L,
 'that': 2L,
 'that!!': 1L,
 'that.': 1L,
 'the': 17L,
 'there': 3L,
 'this': 1L,
 'tie': 1L,
 'ties': 1L,
 'time': 1L,
 'to': 3L,
 'turn': 1L,
 'turns': 1L,
 'use': 1L,
 'use,': 1L,
 'useing': 1L,
 'vie': 1L,
 'vital': 1L,
 'was': 2L,
 'we': 2L,
 'well': 1L,
 'what': 1L,
 'where': 1L,
 'wondered': 1L,
 'work': 1L,
 'you': 1L}

In [59]:
diaper_champ_reviews[1]['review']


Out[59]:
'I LOOOVE this diaper pail!  Its the easiest to use!  after using the diaper genie for 2 months i decided i had enough with the refils and with how much of a pain it is to use.  I purchases this diaper pail in its place and i loooove it!  No more refills, it uses the same bags as my kitchen garbage!  And it holds alot more! I only have to empty it like once a week as oppsed to every other day with the diaper genie.  This is worth the few extra buck because you arnt spending 5 more bucks every othe rweek for refills!  I have a bunch of poopy diapers in mine and you cant even smell them! and i love the fact that you dont have to open it to put a diaper in so i can do it one handed.  Just toss the diaper in the top and flip over the handle, its sooo easy!  And with the diaper genie i noticed that the smell would leak out a bit when you opened the top.  It is a little bigger than the other pails, but it holds alot more!  I would definatly recommend this product to anyone looking to buy a diaper pail!!!!!'

In [60]:
diaper_champ_reviews[-1]['review']


Out[60]:
'My husband and I selected the Diaper "Champ" mainly because you can use ordinary trash bags and not be roped into buying the specialty refill bags, and it was moderately priced (a little less than the Diaper Dekor). It also seemed that the reviews of this product were generally more positive...The positives are:1. You can use any trash bag2. Easy to use and refillThe negatives are:1. The bag doesn\'t seal around the dirty diapers, so when it comes time to refill the bag, it\'s just like opening a regular trash can. Smells like the Champ is trying to knock YOU out with odor!2. The plastic seems to smell, ie. You put a dirty diaper in the hole, and flip the handle to dump the diaper into the champ. That "side" of the plastic dumper-thingie is in contact with the air inside the dirty diaper changer, so when you flip it over the next time to dispose of another diaper, you smell the last 8 diapers you put in there...pretty gross.3. The "odor seal" (some soft material) really seems to retain odor. It cannot be washed or replaced, so after a while, the Diaper "Champ" smells even when freshly washed and deodorized (I\'m talking about hosing down outside and scrubbing with Clorox cleanser!). Super-frustrating! This is my primary complaint.Okay, so some things are a given as far as disposal systems go (ie. *some* odor, must empty frequently, must wash and disinfect occasionally), but still I think this product leaves much to be desired.We\'re going to try another disposal system.'